Skip to content

fix: resolve quiesceAndThenRestartMixedOps() flakiness#24567

Draft
AlexKehayov wants to merge 29 commits intomainfrom
24398-quiescence-follow-up
Draft

fix: resolve quiesceAndThenRestartMixedOps() flakiness#24567
AlexKehayov wants to merge 29 commits intomainfrom
24398-quiescence-follow-up

Conversation

@AlexKehayov
Copy link
Copy Markdown
Contributor

@AlexKehayov AlexKehayov commented Mar 26, 2026

Description:
Follow-up to #24404.

Root Cause:
The original test's log assertion (assertHgcaaLogContains) searched the entire log file and trivially matched a transient startup quiescence event from node boot. It never validated that real sustained quiescence occurred during the test. The rare CI flake was the balance assertion failing due to tight timing margins.

Findings:

  • Transient vs real quiescence: The network produces ~5-6 quiescence transitions per run. All but one are transient (<1ms between "Started" and "Stopping"). The real one lasts ~22s - the heartbeat runs until it finds the TCT near the scheduled tx expiry.
  • Staking period transactions prevent quiescence: With staking.periodMins=1 (test default), staking transactions fire every 60s of consensus time, keeping the network active. Suppressing them is required for reliable quiescence.
  • overridingAllOf doesn't survive restart: It updates system file 0.0.121 in state, but env overrides set via restartAtNextConfigVersion(envOverrides) are set as process environment variables that take effect at startup - before the system file is processed.

Changes:

  • LogContainmentPairTimeframeOp.java (new) - Generic UtilOp that finds two log patterns in order within a timeframe, with a time gap in [minGap, maxGap]. Distinguishes real state transitions from transient flickers.
  • UtilVerbs.java - Added assertHgcaaLogContainsPairTimeframe() verb.
  • LifecycleTest.java - Added restartAtNextConfigVersion(Map<String, String> envOverrides) overload that passes env overrides through to the subprocess nodes.
  • QuiesceThenMixedOpsRestartTest.java - Uses restartAtNextConfigVersion with env overrides to suppress staking transactions. Replaced single-pattern log assertion with paired assertion (minGap=5s, maxGap=40s) that deterministically matches only the real sustained quiescence.

Fixes #24398

@AlexKehayov AlexKehayov self-assigned this Mar 26, 2026
@AlexKehayov AlexKehayov requested review from a team as code owners March 26, 2026 21:58
@trunk-io
Copy link
Copy Markdown

trunk-io bot commented Mar 26, 2026

Merging to main in this repository is managed by Trunk.

  • To merge this pull request, check the box to the left or comment /trunk merge below.

@lfdt-bot
Copy link
Copy Markdown

lfdt-bot commented Mar 26, 2026

Snyk checks have passed. No issues have been found so far.

Status Scan Engine Critical High Medium Low Total (0)
Open Source Security 0 0 0 0 0 issues

💻 Catch issues earlier using the plugins for VS Code, JetBrains IDEs, Visual Studio, and Eclipse.

gkozyryatskyy
gkozyryatskyy previously approved these changes Mar 26, 2026
Copy link
Copy Markdown
Contributor

@gkozyryatskyy gkozyryatskyy left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM! Thank you @AlexKehayov !

@codecov
Copy link
Copy Markdown

codecov bot commented Mar 26, 2026

Codecov Report

✅ All modified and coverable lines are covered by tests.

Impacted file tree graph

@@             Coverage Diff              @@
##               main   #24567      +/-   ##
============================================
+ Coverage     78.23%   78.29%   +0.06%     
+ Complexity    11692    11688       -4     
============================================
  Files          2490     2486       -4     
  Lines         95107    94900     -207     
  Branches      10293    10280      -13     
============================================
- Hits          74406    74305     -101     
+ Misses        16960    16855     -105     
+ Partials       3741     3740       -1     

see 19 files with indirect coverage changes

Impacted file tree graph

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

@github-actions
Copy link
Copy Markdown

github-actions bot commented Mar 26, 2026

Node: HAPI Test (Misc) Results

  160 files  ±0      3 errors  157 suites  ±0   1h 16m 46s ⏱️ - 16m 18s
  170 tests ±0    168 ✅ ±0  2 💤 ±0  0 ❌ ±0 
1 222 runs  ±0  1 220 ✅ ±0  2 💤 ±0  0 ❌ ±0 

For more details on these parsing errors, see this check.

Results for commit 797302a. ± Comparison against base commit bc67352.

♻️ This comment has been updated with latest results.

…to fix quiescence

Signed-off-by: Alex Kehayov <aleks.kehayov@limechain.tech>
@AlexKehayov AlexKehayov force-pushed the 24398-quiescence-follow-up branch from 797302a to d2df7c6 Compare March 28, 2026 02:02
AlexKehayov and others added 12 commits March 28, 2026 04:29
Signed-off-by: Alex Kehayov <aleks.kehayov@limechain.tech>
Signed-off-by: Alex Kehayov <aleks.kehayov@limechain.tech>
Signed-off-by: Alex Kehayov <aleks.kehayov@limechain.tech>
Signed-off-by: Alex Kehayov <aleks.kehayov@limechain.tech>
Signed-off-by: Alex Kehayov <aleks.kehayov@limechain.tech>
Signed-off-by: Alex Kehayov <aleks.kehayov@limechain.tech>
Signed-off-by: Alex Kehayov <aleks.kehayov@limechain.tech>
Signed-off-by: Alex Kehayov <aleks.kehayov@limechain.tech>
Signed-off-by: Alex Kehayov <aleks.kehayov@limechain.tech>
Signed-off-by: Alex Kehayov <aleks.kehayov@limechain.tech>
@AlexKehayov AlexKehayov force-pushed the 24398-quiescence-follow-up branch from f78b067 to 170aca3 Compare March 30, 2026 01:06
AlexKehayov and others added 4 commits March 30, 2026 10:58
Signed-off-by: Alex Kehayov <aleks.kehayov@limechain.tech>
Signed-off-by: Alex Kehayov <aleks.kehayov@limechain.tech>
Signed-off-by: Alex Kehayov <aleks.kehayov@limechain.tech>
Signed-off-by: Alex Kehayov <aleks.kehayov@limechain.tech>
Signed-off-by: Alex Kehayov <aleks.kehayov@limechain.tech>
Signed-off-by: Alex Kehayov <aleks.kehayov@limechain.tech>
@codacy-production
Copy link
Copy Markdown

codacy-production bot commented Mar 31, 2026

Not up to standards ⛔

🟢 Coverage ∅ diff coverage · 0.00% coverage variation

Metric Results
Coverage variation 0.00% coverage variation (-1.00%)
Diff coverage diff coverage

View coverage diff in Codacy

Coverage variation details
Coverable lines Covered lines Coverage
Common ancestor commit (3d9574f) 94803 77996 82.27%
Head commit (583cd04) 94803 (+0) 77994 (-2) 82.27% (0.00%)

Coverage variation is the difference between the coverage for the head and common ancestor commits of the pull request branch: <coverage of head commit> - <coverage of common ancestor commit>

Diff coverage details
Coverable lines Covered lines Diff coverage
Pull request (#24567) 0 0 ∅ (not applicable)

Diff coverage is the percentage of lines that are covered by tests out of the coverable lines that the pull request added or modified: <covered lines added or modified>/<coverable lines added or modified> * 100%

TIP This summary will be updated as you push new changes. Give us feedback

# Conflicts:
#	hedera-node/test-clients/build.gradle.kts
Signed-off-by: Alex Kehayov <aleks.kehayov@limechain.tech>
Signed-off-by: Alex Kehayov <aleks.kehayov@limechain.tech>
@AlexKehayov AlexKehayov force-pushed the 24398-quiescence-follow-up branch from 54942cf to 9f64026 Compare March 31, 2026 13:51
@AlexKehayov AlexKehayov changed the title fix: increase assertion window in quiesceAndThenRestartMixedOps() fix: resolve quiesceAndThenRestartMixedOps() flakiness Apr 2, 2026
@github-actions
Copy link
Copy Markdown

github-actions bot commented Apr 2, 2026

⚠️ New Flaky Test(s) Detected — Action Required

New flaky tests were detected in this run. New tickets have been created and require your attention — please investigate whether the flakiness was introduced by changes in this PR.

Test Ticket
com.hedera.services.bdd.suites.integration.BlockNodeRewardsTests#activeNodeWithOneRegisteredBlockNodeGetsConsensusAndBlockReward #24711 (🆕 New)

Please review the linked tickets and determine if any of the new issues were caused by your changes.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Test flake: JustQuiesceTest > justQuiesce()

3 participants